Augmented Partial Mutual Learning with Frame Masking for Video Captioning

نویسندگان

چکیده

Recent video captioning work improves greatly due to the invention of various elaborate model architectures. If multiple models are combined into a unified framework not only by simple more ensemble, and each can benefit from other, final might be boosted further. Jointly training have been explored in previous works. In this paper, we propose novel Augmented Partial Mutual Learning (APML) method where decoders trained jointly with mimicry losses between different input variations. Another problem is "one-to-many" mapping which means that one identical mapped caption annotations. To address problem, an annotation-wise frame masking approach convert "one-to-one" mapping. The experiments performed on MSR-VTT MSVD datasets demonstrate our proposed algorithm achieves state-of-the-art performance.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deep Learning for Video Classification and Captioning

Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two lines of research aiming to stimulate the comprehension of videos with deep learning: video classification and video captioning. While video classification con...

متن کامل

Video captioning with recurrent networks based on frame- and video-level features and visual content classification

In this paper, we describe the system for generating textual descriptions of short video clips using recurrent neural networks, which we used while participating in the Large Scale Movie Description Challenge 2015 in ICCV 2015. Our work builds on static image captioning system proposed in [22] and implemented in [8] and extends this framework to videos utilizing both static image features and v...

متن کامل

Video Captioning via Hierarchical Reinforcement Learning

Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is still very challenging to caption a video containing multiple fine-grained actions with a detailed description. This paper aims to address the cha...

متن کامل

End-to-End Video Captioning with Multitask Reinforcement Learning

Although end-to-end (E2E) learning has led to promising performance on a variety of tasks, it is often impeded by hardware constraints (e.g., GPU memories) and is prone to overfitting. When it comes to video captioning, one of the most challenging benchmark tasks in computer vision and machine learning, those limitations of E2E learning are especially amplified by the fact that both the input v...

متن کامل

Reconstruction Network for Video Captioning

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i3.16301